{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Statistical Test Specification for TestSuites"
   ],
   "metadata": {
    "id": "QztD6LMKrfn7"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "try:\n",
    "    import evidently\n",
    "except:\n",
    "    !npm install -g yarn\n",
    "    !pip install git+https://github.com/evidentlyai/evidently.git"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from scipy.stats import mannwhitneyu\n",
    "from sklearn import datasets\n",
    "\n",
    "from evidently.calculations.stattests import StatTest\n",
    "from evidently.test_suite import TestSuite\n",
    "from evidently.tests import *"
   ],
   "outputs": [],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "yPH0tLabtNpZ",
    "outputId": "d109922a-9eb8-4918-d01c-874e9cc58dc1"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "warnings.simplefilter('ignore')"
   ],
   "outputs": [],
   "metadata": {
    "id": "rlj43ygQtjGg"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Prepare Datasets"
   ],
   "metadata": {
    "id": "L8YdHdn9rmcg"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "#Dataset for Data Quality and Integrity\n",
    "adult_data = datasets.fetch_openml(name='adult', version=2, as_frame='auto')\n",
    "adult = adult_data.frame\n",
    "\n",
    "adult_ref = adult[~adult.education.isin(['Some-college', 'HS-grad', 'Bachelors'])]\n",
    "adult_cur = adult[adult.education.isin(['Some-college', 'HS-grad', 'Bachelors'])]\n",
    "\n",
    "adult_cur.iloc[:2000, 3:5] = np.nan"
   ],
   "outputs": [],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 354
    },
    "id": "RfuVz0N5q-CR",
    "outputId": "c10ae41b-1a27-4366-a9b0-daa1d6ad7940"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Data Drift Options"
   ],
   "metadata": {
    "id": "9Xzh76_6t337"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "**Notes**: \n",
    "You can specify stattest for features and/or model output in DataDriftOptions\n",
    "\n",
    "* all_features_stattest: Defines a custom statistical test for drift detection in the Data Drift report for all features\n",
    "* num_features_stattest: Defines a custom statistical test for drift detection in the Data Drift report for numerical features only\n",
    "* cat_features_stattest: Defines a custom statistical test for drift detection in the Data Drift report for categorical features only\n",
    "* per_feature_stattest: Defines a custom statistical test for drift detection in the Data Drift report per feature\n",
    "\n",
    "**Available stattests**:  \n",
    "* 'ks' \n",
    "* 'z' \n",
    "* 'chisquare' \n",
    "* 'jensenshannon' \n",
    "* 'kl_div' \n",
    "* 'psi' \n",
    "* 'wasserstein'\n",
    "* 'anderson'\n",
    "* 'fisher_exact'\n",
    "* 't_test'\n",
    "* 'cramer_von_mises'\n",
    "* 'g_test'\n",
    "* 'emperical_mmd'\n",
    "* 'TVD'\n",
    "\n",
    "You can implement a custom drift test and use it in parameters. Just define a function that takes two pd.Series (reference and current data) and returns a number (e.g. p_value or distance)\n",
    "\n",
    "**Usage**:\n",
    "- TestSuite(tests=[TestColumnDrift(column_name='name', stattest=custom_stattest)])"
   ],
   "metadata": {
    "id": "Op_n6pDNrfn9"
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Setting the stattest for the whole dataset"
   ],
   "metadata": {
    "id": "ejDzWf1Xrfn-"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_column_tests = TestSuite(tests=[\n",
    "    TestColumnDrift(column_name='education-num'),\n",
    "    TestColumnDrift(column_name='education-num', stattest='psi')\n",
    "])\n",
    "\n",
    "data_drift_column_tests.run(reference_data=adult_ref, current_data=adult_cur)\n",
    "data_drift_column_tests"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_dataset_tests = TestSuite(tests=[\n",
    "    TestShareOfDriftedColumns(stattest='psi'),\n",
    "])\n",
    "\n",
    "data_drift_dataset_tests.run(reference_data=adult_ref, current_data=adult_cur)\n",
    "data_drift_dataset_tests"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Setting the stattest for numerical and categorical features"
   ],
   "metadata": {
    "id": "It_sFjB1scv6"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_dataset_tests = TestSuite(tests=[\n",
    "    TestShareOfDriftedColumns(num_stattest='psi', cat_stattest='jensenshannon'),\n",
    "])\n",
    "\n",
    "data_drift_dataset_tests.run(reference_data=adult_ref, current_data=adult_cur)\n",
    "data_drift_dataset_tests"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Setting the stattest for individual features"
   ],
   "metadata": {
    "id": "OGfoOUjsrfn_"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "per_column_stattest = {x: 'wasserstein' for x in ['age', 'education-num']}\n",
    "\n",
    "for column in ['sex', 'class']:\n",
    "    per_column_stattest[column] = 'z'\n",
    "\n",
    "for column in ['workclass', 'education']:\n",
    "    per_column_stattest[column] = 'kl_div'\n",
    "\n",
    "for column in [ 'relationship', 'race',  'native-country']:\n",
    "    per_column_stattest[column] = 'jensenshannon'\n",
    "\n",
    "for column in ['fnlwgt','hours-per-week']:\n",
    "    per_column_stattest[column] = 'anderson'\n",
    "\n",
    "for column in ['capital-gain','capital-loss']:\n",
    "    per_column_stattest[column] = 'cramer_von_mises'"
   ],
   "outputs": [],
   "metadata": {
    "id": "Fytttww-xUH4"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "per_column_stattest"
   ],
   "outputs": [],
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/"
    },
    "id": "mzVFW2FqXpjm",
    "outputId": "cee0c64a-9835-4be6-c70b-7291be54065e"
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_dataset_tests = TestSuite(tests=[\n",
    "    TestShareOfDriftedColumns(per_column_stattest=per_column_stattest),\n",
    "])\n",
    "\n",
    "data_drift_dataset_tests.run(reference_data=adult_ref, current_data=adult_cur)\n",
    "data_drift_dataset_tests"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Custom Drift detection test"
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "def _mann_whitney_u(reference_data: pd.Series, current_data: pd.Series, _feature_type: str, threshold: float):\n",
    "    p_value = mannwhitneyu(np.array(reference_data), np.array(current_data))[1]\n",
    "    return p_value, p_value < threshold\n",
    "\n",
    "mann_whitney_stat_test = StatTest(\n",
    "    name=\"mann-whitney-u\",\n",
    "    display_name=\"mann-whitney-u test\",\n",
    "    func=_mann_whitney_u,\n",
    "    allowed_feature_types=[\"num\"]\n",
    ")"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "data_drift_dataset_tests = TestSuite(tests=[\n",
    "    TestShareOfDriftedColumns(num_stattest=mann_whitney_stat_test),\n",
    "])\n",
    "\n",
    "data_drift_dataset_tests.run(reference_data=adult_ref, current_data=adult_cur)\n",
    "data_drift_dataset_tests"
   ],
   "outputs": [],
   "metadata": {}
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "stat_test_specification_for_data_drift_adult.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7 (tags/v3.10.7:6cc6b13, Sep  5 2022, 14:08:36) [MSC v.1933 64 bit (AMD64)]"
  },
  "vscode": {
   "interpreter": {
    "hash": "f1fdbb9839a2a71583b007f6f8ccc2efefb09edbe218b32fc0a8118d70971461"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}